Kristian K Ullrich
Max Planck Institute for Evolutionary Biology
A matter of distance.
Synonymous and NonSynonymous Substitutions: nucleotide substitutions might lead to changes on amino acid level.
Typically used to:
An R/Bioconductor package to calculate pairwise distances between all sequences of a MultipleSequenceAlignment (DNAStringSet or AAStringSet) and to conduct codon based analysis.
MSA2dist Features:
From Bioconductor:
From GitHub:
Data: Whole-genome coding sequences and gene ranges for Chlorophyta algae + Physcomitrium patens as an outgroup.
Source: Pico-PLAZA 3.0
Where to find: the data/ directory in the GitHub repo associated with this presentation.
data
├── clusters.rda
├── codingsequences.rda
└── data_acquisition_cds.md
codingsequences.rda: DNAStringSet object.clusters.rda: Pre-calculated syntenet clusters.codingsequences# Load necessary libraries
library(MSA2dist)
library(Biostrings)
library(dplyr)
library(stringr)
# Load data set codingsequences
load(here::here("data", "codingsequences.rda"))
# Inspect data
head(names(codingsequences))[1] "AP00G00010" "AP00G00020" "AP00G00030" "AP00G00040" "AP00G00050"
[6] "AP00G00060"
clusters Gene Cluster
1 Chlo_CNC64A_028G00030 1
2 Chlo_CNC64A_028G00040 2
3 Chlo_CNC64A_028G00070 3
4 Chlo_CNC64A_028G00080 4
5 Chlo_CNC64A_028G00110 5
6 Chlo_CNC64A_028G00140 6
[1] 22605
1 2 3 4 5 6
38 38 37 2 9 4
# Get one random cluster of size 10
my_cluster <- clusters %>% dplyr::filter(
Cluster==sample(which(table(clusters$Cluster)==10), 1))
head(my_cluster) Gene Cluster
1 Bpra_BPRRCC1105_18G01170 15993
2 Bpra_BPRRCC1105_18G01180 15993
3 Osp_ORCC809_11G01220 15993
4 Oluc_OL11G00350 15993
5 Omed_OM_11G00390 15993
6 Omed_OM_11G00400 15993
# Remove species unique identifier and add as GeneID
my_cluster$GeneID <- stringr::str_split_fixed(my_cluster$Gene, "_", 2)[, 2]
my_cluster Gene Cluster GeneID
1 Bpra_BPRRCC1105_18G01170 15993 BPRRCC1105_18G01170
2 Bpra_BPRRCC1105_18G01180 15993 BPRRCC1105_18G01180
3 Osp_ORCC809_11G01220 15993 ORCC809_11G01220
4 Oluc_OL11G00350 15993 OL11G00350
5 Omed_OM_11G00390 15993 OM_11G00390
6 Omed_OM_11G00400 15993 OM_11G00400
7 Msp_MRCC299_12G04700 15993 MRCC299_12G04700
8 Mpus_MP10G00340 15993 MP10G00340
9 Otau_OT_11G00400 15993 OT_11G00400
10 Otau_OT_11G00390 15993 OT_11G00390
# Fetch coding sequences
my_cds <- codingsequences[match(my_cluster$GeneID,
names(codingsequences))]
my_cdsDNAStringSet object of length 10:
width seq names
[1] 2859 ATGAAGCCAAAAGTTTCCAAATA...AGAAAATGTAGAAAGAGATTAG BPRRCC1105_18G01170
[2] 1350 ATGAGTTCAATTACGTGTAATAC...TAGGTGTTTCCAAGGATTCTAG BPRRCC1105_18G01180
[3] 2829 ATGTCTAGAGCGACGCGTTTGGA...CGAGAGAAGGGAGAACCCGTAG ORCC809_11G01220
[4] 2985 ATGGAGGCGGACGACGCGGCGGC...AAGCGCCCCAAAGCGCACTTAG OL11G00350
[5] 1371 ATGAAGCGCACCAGGAGCGACGA...TGCCTGCTTTCAAGGATTTTAA OM_11G00390
[6] 2079 ATGCCGTCGCCCGCACCGCGCGC...CGCTCGAGCCGCCGCTCCGTAA OM_11G00400
[7] 1941 ATGGACGAGCTGGGTGGAGGCTG...CGATCGATGCTTCATGTACTGA MRCC299_12G04700
[8] 1509 ATGGGCGGCGACGACGCGACCTC...CGACGCGTGCTTCATGTTCTGA MP10G00340
[9] 1950 ATGTCCACGAGCGAGAATTCTCC...CGGACGCCACTGCGACCGCTAA OT_11G00400
[10] 1152 ATGGCGTCGACGGGCGTGGACGA...TGAATGTTTCCAGGGATTCTAA OT_11G00390
To calculate a coding sequence alignment for two sequences:
# Get coding sequence alignment
cds1_cds2_aln <- cds2codonaln(cds1 = my_cds[1], cds2 = my_cds[2], remove.gaps = FALSE)
cds1_cds2_alnDNAStringSet object of length 2:
width seq names
[1] 2907 ATGAAGCCAAAAGTTTCCAAATA...AAGAAAATGTAGAAAGAGATTAG BPRRCC1105_18G01170
[2] 2907 ATG--------------------...--------------------TAG BPRRCC1105_18G01180
To calculate dN/dS from this alignment:
To calculate all pairwise combinations:
start_time <- Sys.time()
# Calculate dN/dS for the whole cluster; model = YN
my_cds_kaks <- dnastring2kaks(my_cds,
model = "YN", threads = 2, isMSA = FALSE)
end_time <- Sys.time()
head(my_cds_kaks) Comp1 Comp2 seq1 seq2 Method Ka
result.1 1 2 BPRRCC1105_18G01170 BPRRCC1105_18G01180 YN 0.769457
result.2 1 3 BPRRCC1105_18G01170 ORCC809_11G01220 YN 0.493889
result.3 1 4 BPRRCC1105_18G01170 OL11G00350 YN 0.504261
result.4 1 5 BPRRCC1105_18G01170 OM_11G00390 YN 0.916875
result.5 1 6 BPRRCC1105_18G01170 OM_11G00400 YN 0.528609
result.6 1 7 BPRRCC1105_18G01170 MRCC299_12G04700 YN 0.869959
Ks Ka/Ks P-Value(Fisher) Length S-Sites N-Sites
result.1 2.61875 0.293826 4.60612e-14 1299 284.142 1014.86
result.2 4.50708 0.109581 1.64868e-70 1803 407.257 1395.74
result.3 4.54449 0.110961 2.3001e-66 1872 428.082 1443.92
result.4 4.26276 0.21509 1.06185e-13 1257 294.029 962.971
result.5 4.5556 0.116035 4.22173e-70 1857 434.47 1422.53
result.6 4.40849 0.197337 3.79844e-39 1644 357.089 1286.91
Fold-Sites(0:2:4) Substitutions S-Substitutions N-Substitutions
result.1 NA 691 206.618 484.382
result.2 NA 848 344.128 503.872
result.3 NA 882 353.342 528.658
result.4 NA 731 224.582 506.418
result.5 NA 906 367.581 538.419
result.6 NA 973 312.786 660.214
Fold-S-Substitutions(0:2:4) Fold-N-Substitutions(0:2:4)
result.1 NA NA
result.2 NA NA
result.3 NA NA
result.4 NA NA
result.5 NA NA
result.6 NA NA
Divergence-Time Substitution-Rate-Ratio(rTC:rAG:rTA:rCG:rTG:rCA/rCA)
result.1 1.17397 1.24685:1.24685:1:1:1:1
result.2 1.40038 1.1745:1.1745:1:1:1:1
result.3 1.42817 1.16575:1.16575:1:1:1:1
result.4 1.69952 1.07489:1.07489:1:1:1:1
result.5 1.47078 0.907996:0.907996:1:1:1:1
result.6 1.63855 1.17216:1.17216:1:1:1:1
GC(1:2:3) ML-Score AICc Akaike-Weight Model
result.1 0.349329(0.376677:0.27967:0.391641) NA NA NA NA
result.2 0.415121(0.440495:0.330757:0.474111) NA NA NA NA
result.3 0.422399(0.447846:0.334089:0.485261) NA NA NA NA
result.4 0.374579(0.407071:0.305051:0.411616) NA NA NA NA
result.5 0.429662(0.47807:0.353314:0.457602) NA NA NA NA
result.6 0.439264(0.455756:0.354424:0.507612) NA NA NA NA
How long did it take?
Here, a pre-calculated MSA is necessary:
Plot average behavior of each codon:
Kristian K Ullrich @kullrich